Abstract

In my reanalysis of the publication of Heintz et al. (2017) I could identify a ca. 50% overlap of differentially expressed genes and a >50% agreement of targeted KEGG pathways. This was consitent across the four different contrasts for which the authors provided raw expression tables. The authors also compared the differential expression also on KEGG-pathway level for middle-age C. elegans (day 15). They could identify lysosome, peroxysome, drug metabolism, cytochrome P450 and fatty acid metabolism as the most enriched pathways in DR-animals compared to normal-lived wild type animals. I could confirm this enrichment both starting from the fastq files as well as from their supplied expression quantification data.

1 Introduction

The improvement of public health conditions at the beginning of the 20th century led to a steep and continuous increase of the average human lifespan. Medical breakthroughs eradicated infectious diseases, however, age-dependent chronic diseases (cancer, diabetes, cardiovascular and neurodegenerative illnesses) became the predominant cause of death. The main risk factor for developing these diseases is our age. The goal of aging research is to prolong the time we spend in good health as we grow older.

The rate at which organisms age and in turn become susceptible to age-dependent diseases is plastic. For instance, reducing the activity of one gene ( daf-2 / insulin receptor) in C. elegans is sufficient to double the lifespan of the animal (Kenyon et al. 1993). Reducing the food intake through dietary restriction (DR) is another mechanism through which animals can robustly achieve lifespan extension Weindruch and Walford (1988). A leading hypothesis in the field to explain aging is that aging is be caused by accumulation of cellular damage (López-Otín et al. 2013). Hence, longevity interventions likely mobilize mechanisms that repair damaged cellular components (Kenyon 2010, López-Otín et al. (2013)). The observations of Patro et al. (2017) that the global mRNA splicing efficiency was maintained better in long- compared to normal-lived C. elegans are in line with this hypothesis. In a nutshell, the selling point of the publication is that modulation of selected splicosome components increases lifespan and could promote healthy aging. The main component of the publication that was selected for re-analysis is the differential expression analysis between normal- and long-lived individuals - potentially containing information how health can be longer maintained in long-lived individuals.

1.0.1 Biological background underlying the RNA sequencing analysis:

To be able to study pre-mRNA splicing under dietary restriction condition Heintz et al. (2017) used the multicellular model Caenorhabditis elegans. To study dietary restriction the authors used for all RNA sequencing experiments an eat-2 C. elegans mutant strain that is genetically restricted in its food uptake. Since C. elegans is transparent they were able to use a fluorescent in-vivo alternative splicing reporter. By using two fluorophores, one positioned in frame and the other with a frameshift they were able to monitor exon skipping and thus, erroneous splicing (reporter gene: ret-1). They could observe that as the animal aged more erroneous splicing occured. Interestingly, under dietary restriction conditions older animals retained a more youthful splicing pattern - under dietary restriction the lifespan of C. elegans is extended (here: 65% increase, P < 0.0001). Furthermore, when the authors separated the normal-lived nematodes (no DR) based “youthful” or “aged” splicing pattern, the animals with a youthful signature outlived their isogenic clones (23% increase, P < 0.0001). This indicates that a tightly controlled splicing machinery is both required and even sufficient to increase lifespan in C. elegans.

Heintz et al. (2017) performed a targeted reverse genetic screen for spliceosome components that only affect lifespan in long-lived DR animals but not in wild type and identified the gene sfa-1. This gene encodes the branch-point binding protein splicing factor 1 (Krämer 1992). sfa-1 RNAi eliminates the lifespan extension of dietary restriction but does not affect the wild type. This finding suggests a link between sfa-1, mRNA processing and lifespan extension via dietary restriction in C. elegans.

The authors show:

  • The enriched KEGG pathways are significantly different between wild type and dietary restricted animals (analyzed at middle-age, day 15)

  • This change in pathway activation is dependent on sfa-1

  • sfa-1 RNAi does reverse all expression changes induced through dietary restriction. If combined with DR, sfa-1 RNAi treatment specifically changes the lipid and fatty acid metabolism back to levels observed in wild type.

The parts of the publication that were selected for re-analysis were highlighted above. This work by Heintz et al. (2017) is extremely rich on information and contains a plethora of experiments (wet lab and computationally) which I will not include. In summary, by quantifying splicing efficiency and by crossing their splicing reporter strain into different DR null mutants the authors could link SFA-1 to healthy aging via TORC1 suppression and dietary restriction.

1.0.2 RNA sequencing - study design

For this study 100 base pair paired-end RNA-seq samples were taken from C. elegans of different ages, feeding regimes and subjected to different RNAi treatments. Importantly, at all time points only life animals were harvested. Every sample contained > 500 animals and taken in four biological replicates. In brief, RNA extraction was performed using Qiazol reagent (QIAGEN), column purified (QIAGEN), cDNA synthesized using SuperScript (Invitrogen) and the library prepared according to the Nextera XT DNA library protocol (Illumina). Sequencing was performed using Illumina HiSeq2500 with 100-cycle paired- end sequencing.

Raw reads were processed in the following way: Adapters were removed using cutadapt (Martin 2011) using the parameters: “–trim-n -m 15”. Genome alignment was performed using STAR 2.5.0c (Dobin et al. 2013). Additional parameters used for alignment are: –alignIntronMax 50000, –outSAMstrandField intronMotif and –outFilterType BySJout for all alignments. htseq-count (Anders, Pyl, and Huber 2015) was employed to obtain gene counts and DESeq2 (Love, Huber, and Anders 2014) was used for gene expression analysis. The Benjamini–Hochberg method was used to correct for multiple testing (Benjamini and Hochberg 1995) and all genes with a an FDR-adjusted p-value > 0.1 were treated as differentially expressed. The differentially expressed genes were then analyzed with goseq (Young et al. 2010) for GO term enrichment and gage (Luo et al. 2009) for KEGG pathway enrichment analysis. As on the level of individual genes on the level of GO terms or KEGG pathways the significance threshold was defined at a Benjamini–Hochberg-adjusted P value < 0.1. The raw reads were made available on the ArrayExpress archive under the accession number E-MTAB-4866.

2 Re-analysis pipeline:

All steps performed in the re-analysis are described and commented as they are being used.

2.0.1 Quality check

To assess the quality of the RNA sequencing samples the program FastQC was utilized. No samples were excluded.

2.0.2 Trimming reads

The adapter sequences used for RNA-sequencing need to be removed before the reads can be aligned. For this the program cutadapt was employed. It can be obtained conveniently using the pip3 packet manager: sudo pip3 install cutadapt. Apart from read trimming no other manipulations were performed. Short reads introduced by the trimming process were not removed.

The raw fastq files are imported from a directory (“untrimmed”), the adapters removed from the paired end samples and then written to a new directory (“trimmed”) using the code below.

untrimmed_dir <- "/Users/cys/Desktop/STA426_project/untrimmed"
data_dir <- "/Users/cys/Desktop/STA426_project/trimmed"

files <-dir(untrimmed_dir)[grep("fastq",dir(untrimmed_dir))]
files <- files[1:length(files)]
f1 <- files[grep("_1",files)]
f2 <- files[grep("_2",files)]
name <- unname(sapply(f1,function(x){strsplit(x,"_")[[1]][1]}))

for (i in 1:length(f1)) {
  print(i)
  print(f1[i])
  print(f2[i])
  print(name[i])
  
  (cmd <- sprintf("cutadapt --trim-n -m 15 -o %s -p %s -a %s -A %s %s %s",
                paste0(data_dir, "/","trimmed_",f1[i]),
                paste0(data_dir, "/","trimmed_",f2[i]),
                "CTGTCTCTTATACACATCTCCGAGCCCACGAGA",
                "CTGTCTCTTATACACATCTGACGCTGCCGACGA",
                paste0(untrimmed_dir, "/",f1[i]),
                paste0(untrimmed_dir, "/",f2[i])
                ))
  system(cmd)
}

2.1 Salmon

Salmon is a new tool for the quantification of gene expression data developped by Patro et al. (2017). It can use fastq files as input and can accurately estimate gene expression within an extremely short timespan using the computational power of a personal computer. Unlike classical aligners Salmon is based on quasi-mapping and a two-phase inference method. In brief, quasi mapping is a surrogate for the actual alignment that is performed by traditional aligners (e.g STAR). In two steps, first the reference transcriptome is indexed using the transcriptome of selected organism (coding and non-coding RNA) and only has to be run once. The quantification process uses a two-phase inference procedure which includes a bias model to adjust for common technical RNA-sequencing artefacts like GC-content as well as sequence- and position-specific biases. It is important to note, that the quantification step can also be performed using reads that were traditionally aligned (BAM format) and is independent of Salmon’s internal quasi mapping step. Further information can be found at the combine labs website

2.1.1 Salmon workflow

The salmon workflow developped for this reanalysis is largely based on the exercise 9 of the STA426 course. Dr. Charlotte Soneson is acknowledged here for providing the exercise and very understandable explanation of its components.

2.1.1.1 Seting paths for the Salmon analysis

The directories for the input and output directories as well as the file path to the salmon program and libraries on the local machine are specified here and are used throughout the file.

data_dir <- "/Users/cys/Desktop/STA426_project/trimmed" #location of fastq files
output_dir <- "/Users/cys/Desktop/STA426_project/Salmon_output" #Here will be my output files
path_to_salmon <- "DYLD_FALLBACK_LIBRARY_PATH=/Users/cys/Desktop/STA426_project/Binaries/salmon/lib /Users/cys/Desktop/STA426_project/Binaries/salmon/bin/salmon"
## Create subfolders
system(sprintf("mkdir -p %s/RNAseq", output_dir))
system(sprintf("mkdir -p %s/RNAseq/reference_Celeg/index", output_dir))

2.2 Reference files

All reference files for C.elegans were downloaded from Ensembl. Here, reference sequencing data can be found for numerous organisms.

2.2.1 Reference C. elegans transcriptome

Here, the coding and non-coding reference files are merged yielding all available cDNA information in one file which can then be used by Salmon to align the reads to. This step only has to be run once.

(cmd <- sprintf("cat %s %s > %s",
                "/Users/cys/Desktop/STA426_project/Reference_FASTA/Caenorhabditis_elegans.WBcel235.cdna.all.fa", #coding RNA
                "/Users/cys/Desktop/STA426_project/Reference_FASTA/Caenorhabditis_elegans.WBcel235.ncrna.fa", #Non Coding RNA
                paste0(output_dir, "/RNAseq/reference_Celeg/Celeg_cdna.ncrna.fa")))
## [1] "cat /Users/cys/Desktop/STA426_project/Reference_FASTA/Caenorhabditis_elegans.WBcel235.cdna.all.fa /Users/cys/Desktop/STA426_project/Reference_FASTA/Caenorhabditis_elegans.WBcel235.ncrna.fa > /Users/cys/Desktop/STA426_project/Salmon_output/RNAseq/reference_Celeg/Celeg_cdna.ncrna.fa"
#system(cmd)

2.2.2 Index reference transcriptome for subsequent use with Salmon

In order to be used by salmons quantification algorithm the reference transcriptome needs to be indexed. This step only has to be run once.

(cmd <- sprintf("%s index -t %s -i %s -p 1",
                path_to_salmon,
                paste0(output_dir, "/RNAseq/reference_Celeg/Celeg_cdna.ncrna.fa"),
                paste0(output_dir, "/RNAseq/reference_Celeg/Celeg_cdna.ncrna.sidx")))
## [1] "DYLD_FALLBACK_LIBRARY_PATH=/Users/cys/Desktop/STA426_project/Binaries/salmon/lib /Users/cys/Desktop/STA426_project/Binaries/salmon/bin/salmon index -t /Users/cys/Desktop/STA426_project/Salmon_output/RNAseq/reference_Celeg/Celeg_cdna.ncrna.fa -i /Users/cys/Desktop/STA426_project/Salmon_output/RNAseq/reference_Celeg/Celeg_cdna.ncrna.sidx -p 1"
# system(cmd)

2.2.3 Quantify transcript abundances

The trimmed fastq files are quantified by Salmon. This process is a bottle-neck in the analysis takes ca 20 minutes for a paired end sample.

files <-dir(data_dir)[grep("fastq",dir(data_dir))]
files <- files[1:length(files)]
f1 <- files[grep("_1",files)]
f2 <- files[grep("_2",files)]
name <- unname(sapply(f1,function(x){strsplit(x,"_")[[1]][2]}))

for (i in 1:length(f1)) {

  (cmd <- sprintf("%s quant -i %s -l A -1 %s -2 %s -o %s -p 1",
                path_to_salmon,
                paste0(output_dir, "/RNAseq/reference_Celeg/Celeg_cdna.ncrna.sidx"),
                paste0(data_dir, "/", f1[i]),
                paste0(data_dir, "/", f2[i]),
                paste0(output_dir, "/RNAseq","/",name[i])))
  #system(cmd)
}
print("Salmon quantification has finished")

2.2.4 Generate transcript-to-gene mapping

The goal of this analysis is to perform a quantification estimation on the level of genes and not transcripts. This means that we have to link the different transcripts to the genes they originate from to be able to then quantify the gene expression. The quantification of all transcripts that originate from identical loci are combined to yield a gene expression quantification. This is a standard process, however, one has to be aware that in this step information is destroyed because processes like differential isoform usage are not accounted for when all transcripts are placed in a bin corresponding to the gene loci they originated from. If these processes are to be included this “gene-mapping” step should be omitted and a different subsequent pipeline chosen.

2.2.5 transcript-to-gene mapping

The readDNAStringSet function from the Biostrings package reads the transcriptome reference (fasta file) that was generated by merging the coding and non-coding transcripts. Subsequently, we generate a transcript to gene mapping by taking the transcript name and its corresponding gene name from the transcript description. This is then stored in a two-column dataframe as a lookup table for subsequent use (shown below).

suppressPackageStartupMessages(library(Biostrings))
cdna.ncrna <- readDNAStringSet(paste0(output_dir,"/RNAseq/reference_Celeg/Celeg_cdna.ncrna.fa"))

transl <- function(name_string) {
  tmp <- strsplit(name_string, " ")[[1]]  # segment the transcript description string at white space
  tx <- tmp[1]                            # select first element - the TRANSCRIPT NAME
  gene <- gsub("gene:", "", tmp[grep("^gene:", tmp)]) # extract the GENE NAME from the full description
  c(tx = tx, gene = gene)                 #  link the transcript to the gene name by creating a a two column translation table
}

tx2gene <- data.frame(t(sapply(names(cdna.ncrna), transl)), stringsAsFactors = FALSE)
rownames(tx2gene) <- NULL #remove the very long gene description names - we have now the consise transcript to gene mapping

head(tx2gene)

2.3 Differential gene expression

Generate all file paths to to the Salmon quantification files (qunat.sf) of this experiment.

dirs <- list.files(paste0(output_dir,"/RNAseq"), full.names = TRUE)
dirs <- dirs[-grep("reference",dirs)] # add this in order to skip the reference file
files <- paste0(dirs, "/quant.sf") # file paths to the quantification for every read pair
names(files) <- basename(dirs)

2.3.1 Expression matrix assembly

Assemble the individual quantification files into an overall quantification matrix. The columns are the individual runs (each two paired-end fastq files) and the rows are the genes to which the transcripts match. The package reader is used to speed up the file import. The tximport function of the tximport package imports the transcript abundances that were generated using various software (here Salmon specified) into a user defined format. By specifying we the tx2gene argument the transcript level data is summarized at gene level using the lookup table. Setting dropInfReps to TRUE is needed to not skip the reading of inferential replicates (estimated counts) which are needed to process Salmon quantification data. The inferential replicates are then summarized into a single matrix containing the variance per transcript and per sample.

suppressPackageStartupMessages(library(tximport))
suppressPackageStartupMessages(library(readr)) # to speed up the importing process
txi <-tximport::tximport(files = files, type = "salmon", tx2gene = tx2gene, dropInfReps = TRUE)
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
## summarizing abundance
## summarizing counts
## summarizing length
txi$counts[1:5,1:5]
##                ERR1474664 ERR1474665 ERR1474666 ERR1474667 ERR1474668
## WBGene00000001   655.7529    565.423   847.1810  1436.1710    744.003
## WBGene00000002    81.0000     17.000    33.5132    66.4820     36.000
## WBGene00000003   349.0000    123.000   221.0000   513.0000    425.000
## WBGene00000004   110.0000     83.000    61.0000   246.9998     99.000
## WBGene00000005    40.0000     14.000    68.0000   111.9996      2.000

The final count matrix has 46778 genes (rows) and 32 samples (columns)

2.3.2 Converting the expression matrix to a DGEList object

The expression matrix is converted to a DGEList object. The object holds information on the counts, samples and genes.

#biocLite("edgeR")
suppressPackageStartupMessages(library(edgeR))
dge <- DGEList(counts = txi$counts, genes = data.frame(gene.id = rownames(txi$counts)))
names(dge)
## [1] "counts"  "samples" "genes"

2.3.3 Gene information annotation

Using the WormbaseID gene ID contained in the ENSEMBL reference more gene information can be retrieved using the biomart package. Human readable gene symbols and entrez gene IDs are downloaded and assigned to the respective gene description in the DGEList object

# Add gene annotation information using biomart
suppressPackageStartupMessages(library(biomaRt))
ensembl <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", 
                   dataset="celegans_gene_ensembl", host = "www.ensembl.org")
convtable <- getBM(ensembl, attributes = c("ensembl_gene_id", "entrezgene", "wikigene_name"),
                   filters = "ensembl_gene_id", values = as.character(dge$genes$gene.id))
dge$genes$symbol <- convtable$wikigene_name[match(dge$genes$gene.id, convtable$ensembl_gene_id)]
dge$genes$entrez <- convtable$entrezgene[match(dge$genes$gene.id, convtable$ensembl_gene_id)]

head(dge$genes)

2.3.4 Library size normalization

Since the different sequencing experiments vary in their library size it is important to account for these differences in sequencing depth by normalizing on the library size. The normalization factors are written to the sample description in the DGEList object.

dge <- calcNormFactors(dge) # calculate normalization factors for later expression-level filtering
head(dge$samples, n= 5)

2.3.5 Meta information

To be able to interpret the data we need to assign the biological conditions to the samples. This is done by merging a meta information dataframe with the sample information contained in the DGEList object based on the sample Run identifier. Furthermore, all meta information is condensed in an additional identifier. This identifier allows the grouping of all samples based on a single column - which is very useful for later contrast definition. This approach is outlined in the edgeR manual section 3.3.1: “Defining each treatment combination as a group”

# Add experimental conditions: normal (N2) to long-lived (eat-2)
meta <- data.frame("Run" = names(files),
              "Strain" = c(rep(c("eat2","N2"),each = 8),rep(c("eat2","N2"),each = 8)), 
              "RNAi" = rep(c("ev","sfa1","ev","sfa1","ev","sfa1","ev","sfa1"),each = 4),
              "Age" = c(rep("15",16),rep("3",16)))

ID <- factor(paste(meta$Strain,meta$RNAi,meta$Age,sep="."))
meta <- data.frame(meta,ID)

dge$samples <- merge(dge$samples, meta, by.x = 0, by.y = "Run", all = TRUE)
dge$samples[seq(1,20,by=4),]

2.3.6 Filtering lowly expressed genes

The current count table 46778 genes are listed. For many of these genes, however, we do not have sufficient information to draw conlcusions whether they are differentially expressed between samples or expressed at all. On biological grounds, a gene has to be expressed at a minium level to be able to have an effect on the organism and genes which are not expressed should not be considered. From a statistical viewpoint, genes which are consitently lowly or not expressed across conditions do not add information to the differential expression analysis Chen, Lun, and Smyth (2016).

To be able to filter with the same stringency across all samples we use the counts per million and not the absolut count information. As a cutoff we select a cpm value of larger than 0.5 in at least two samples / libraries.

After this filtering step the library normalization factors have to be calculated again since the number of counts has changed.

cpm <- cpm(dge)
dge <- dge[which(rowSums(cpm > 0.5) > 2), , keep.lib.sizes = FALSE] # remove lowly expressed genes:
dge <- calcNormFactors(dge) # calculate normalization factors again now the data was filtered
nrow(dge$counts)
## [1] 20994
# 20994

2.3.7 MDS plot of expression data

An MDS plot structures shows similarities and dissimilarities between samples by computing sample similarities in an unsupervised manner. Using an MDS plot one can already estimate the extent to which samples will show differential expression between each other based on their clustering pattern in the plot. The first dimension is able to explain the variation in the data the best and the other dimensions have a smaller effect.

In the day 3 samples clustering is driven by strain and in day 15 it is governed mostly by the RNAi condition and not by strain.

par(mfrow = c(1,2),mar = c(7,4,4,2),mai = c(1.2,1,1,0.5))

d3 <- dge[,dge$samples$Age ==3]
plotMDS(d3,
        labels = d3$samples$Strain,
        col = as.numeric(factor(d3$samples$RNAi)),
        main = "day 3")
legend("bottomleft",c("ev","sfa-1"),col = c(1,2), bty = "n",pch = 15)
mtext("Sample clustering is driven by strain while RNAi state \n does not appear to have an effect ",side = 1, line = 5,cex = 0.8,outer = FALSE)


d15 <- dge[,dge$samples$Age == 15]
plotMDS(d15,
        labels = d15$samples$Strain,
        col = as.numeric(factor(d15$samples$RNAi)),
        main = "day 15")
legend("bottomleft",c("ev","sfa-1"),col = c(1,2), bty = "n", pch = 15)

mtext("The sample clustering is governed mostly by the \n RNAi condition and not by strain.",side = 1, line = 5,cex = 0.8,outer = FALSE)

2.4 Design matrix and contrasts

2.4.1 Design matrix

We construct here a design matrix without intercept. Based on the later definition of the contrasts we must not use an intercept to be able to directly compare all groups with each other. For this analysis the entire sample information is summarized in one combined meta table column. Every column of the design matrix represents one condition. The first are displayed below.

design <- model.matrix(~0+ID)
colnames(design) <- levels(ID)
dge <- estimateDisp(dge, design = design)
head(design, n=5)
##   eat2.ev.15 eat2.ev.3 eat2.sfa1.15 eat2.sfa1.3 N2.ev.15 N2.ev.3
## 1          1         0            0           0        0       0
## 2          1         0            0           0        0       0
## 3          1         0            0           0        0       0
## 4          1         0            0           0        0       0
## 5          0         0            1           0        0       0
##   N2.sfa1.15 N2.sfa1.3
## 1          0         0
## 2          0         0
## 3          0         0
## 4          0         0
## 5          0         0

2.4.2 Contrasts

Contrasts can defined by substracting the two desired columns in the design matrix from each other. If multiple columns are compared they can be added within brackets and then substracted from the other column or column group. One has to be careful to divide the groups by the number of their members if they are not equal between groups.

Contrasts are constructed by comparing columns of the design matrix. In the example below the design_matrix_column_A is compared to the design_matrix_column_B. The order defines the sign (+/-) of the reported expression change between the two samples. A positive fold change describes a gene which is up-regulated in design_A relative to design_B. In oder words, the order is crucial when defining contrasts and the sign is always taking the perspective of the first specified design column: logFC of +2.8 means that the gene is 2\(^2.8\) = 7.0 times higher expressed in design_A compared to design_B.

\[contrast_1 = design_A - design_B\]

# contrast definition
my.cont <- makeContrasts(
  # within the same age cohort
  RNAi_eat2_15 = eat2.ev.15 - eat2.sfa1.15,
  RNAi_eat2_3 = eat2.ev.3 - eat2.sfa1.3,
  RNAi_N2_3 = N2.ev.3 - N2.sfa1.3, # as ctr, the MDS plot indicates no difference
  RNAi_N2_15 = N2.ev.15 - N2.sfa1.15,
  Strain_ev_15 = eat2.ev.15 - N2.ev.15,
  Strain_ev_3 = eat2.ev.3 - N2.ev.3,
  Strain_sfa1_15 = eat2.sfa1.15 - N2.sfa1.15,
  Strain_sfa1_3 = eat2.sfa1.3 - N2.sfa1.3,
  # comparisons between different age cohorts
  Strain_ev = (eat2.ev.15 + eat2.ev.3)/2 - (N2.ev.15 + N2.ev.3)/2, # Strain difference
  d15.vs.d3_N2_ev = N2.ev.15 - N2.ev.3,                # to compare with supplementary information table 4, ev
  d15.vs.d3_N2_sfa1 = N2.sfa1.15 - N2.sfa1.3,          # to compare with supplementary information table 4, sfa1
  d15.vs.d3_eat2_ev = eat2.ev.15 - eat2.ev.3,          # to compare with supplementary information table 5, ev
  d15.vs.d3_eat2_sfa1 = eat2.sfa1.15 - eat2.sfa1.3,    # to compare with supplementary information table 5, sfa1
  # Other contrasts
  N2_vs_eat2_ev_15 = N2.ev.15 - eat2.ev.15, #changes in N2 relative to eat2
  sfa1_vs_ev_eat2_15 = eat2.sfa1.15 - eat2.ev.15, #changes in sfa-1 relative to ev
  levels=design)
  • Side note on more complex contrasts: This analysis for example: Strain_ev = (eat2.ev.15 + eat2.ev.3)/2 - (N2.ev.15 + N2.ev.3)/2 asks what genes are differentially expressed between the average of (eat2.ev.15 + eat2.ev.3) and the average of (N2.ev.15 + N2.ev.3). This means that if a gene is highly expressed in eat2.ev.3 but not under any other condition it will be reported. If one wants to be more stringent and only look for genes differentially expressed in both eat2.ev.15 AND eat2.ev.3 compared to N2.ev.15 AND N2.ev.3 one has to work with intersect terms. These can be applied after the analysis by performing logical operations (AND, OR) on the decideTest() output, which is more stringent.

2.4.3 Generalized linear model

A generalized linear model (glm) is used to analyze the data. GLMs are implemented in the edgeR package and can be applied to mulifactor experiments of all complexities and were developped in a cooperative effort by McCarthy, Chen, and Smyth (2012), Lund et al. (2012), Chen, Lun, and Smyth (2014) and Lun, Chen, and Smyth (2016). The edgeR package also contains exact models developped by Robinson and Smyth (2007a) and Robinson and Smyth (2007b) in addition to the GLM models. Both the exact models and the GLMs are empirical Bayes methods, which allow researchers to estimate variation. Here we use quasi-likelihood F-tests developped by Lund et al. (2012) and Lun, Chen, and Smyth (2016).

# Assigning a folder for differential expression tables
system(sprintf("mkdir -p %s/FCtables", output_dir))

## Quasi-likelihood F-test: starting with glmQLFit
suppressPackageStartupMessages(library(statmod))
fit <- glmQLFit(dge, design = design, robust = TRUE)

Before we go forward, we plot the p-value histograms for all contrasts. This is to analyze the p-value distribution for the individual Null hypothesis and to gain insights whether we have an indication that the Null hypothesis is wrong.

2.4.3.0.1 Control comparison, difference between RNAi treatments for young wild-type animals

As expected, the p-values for the RNAi_N2_3 contrast follow a normal distribution since the two samples that are compared here cluster closely together in the MDS plot. This is expected and we can go forward and compare other contrasts for wich we expect to see an expression difference according to the MDS plots.

2.4.3.0.2 Comparisons within the same age group

The p-value histograms are shown for contrasts between RNAi and DR treatments always withing the same age cohort. With the exception of the histogram shown in the previous section we can see a anti-conservative distributions for all contrasts. This indicates that the alternative is likely true for these contrasts. Interestingly, one can observe that the percentage of genes for which the alternative is true is higher at day 15 than dqy 3 - this could explain why Heintz et al. (2017) preferentially focussed on this age cohort for their analysis.

2.4.3.0.3 Comparisons between different age groups

The percentage of genes for which the alternative appears to be true is higher for age comparisons (d15.vs.d3_X_Y) compared to the within-age-group comparisons shown in the previous histograms. For these 4 the authors supplied their log-fold change comparisons in their Supplementary Information section.

Both the within and across age group comparisons are evaluated below.

3 Effect of DR and sfa-1 RNAi on gene expression

The plots shown below are in agreement with the p-value distributions shown before, indicating a high percentage of differentially expressed genes for the contrasts: RNAi treatments for both DR and N2 animals at day 15 while the same comparison at day 3 does not show a large expression change. Furthermore, the two strains (normal and dietary restricted) show only a difference at day 15 for control RNAi. Under sfa-1 treatment they already differ at day 3.

## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.

3.1 What aspect of DR is reversed by sfa-1 RNAi?

Heintz et al. (2017) detail that: " To define the effects of dietary restriction that are specifically mod- ulated by SFA-1, we first analysed differential gene expression changes between animals fed ad libitum and animals on dietary restriction at day 15, with and without sfa-1 RNAi. We then determined the KEGG pathways significantly altered by dietary restriction in day 15 animals in an SFA-1-dependent manner. RNAi of sfa-1 did not block all dietary restriction-related changes to gene expression, but instead specifically reversed upregulation of lipid/fatty acid metabolism genes induced by dietary restriction "

Author’s figure 3.a, gene expression profiles of middle-aged, day 15 C.elegans. The green box highlights the KEGG pathways which are upregulated under DR condition but are downregulated back to wild type levels when treated with sfa-1 RNAi

Author’s figure 3.a, gene expression profiles of middle-aged, day 15 C.elegans. The green box highlights the KEGG pathways which are upregulated under DR condition but are downregulated back to wild type levels when treated with sfa-1 RNAi

To reproduce this claim we take two different approaches.

  • In the first approach, the intersect of all downregulated genes in the two contrasts: N2_vs_eat2_ev_15 and sfa1_vs_ev_eat2_15 is taken (FDR cutoff: p-value < 0.1) and the KEGG pathways over-represented among these genes is analyzed.

  • In a second analysis, the KEGG pathway IDs that are observed to be downregulated for each contrast (N2_vs_eat2_ev_15, sfa1_vs_ev_eat2_15) (FDR cutoff: p-value < 0.1) are isolated and only afterwards the intersection is taken.

By reading the publication, it was expected, that the second protocol is more in agreement with the author’s approach. However, the lack of overlap between the two approaches is astonishing.

3.1.0.1 First intersect, then KEGG pathway enrichment

tr_N2 <- glmTreat(fit, contrast=my.cont[,"N2_vs_eat2_ev_15"]) # changes from DR to no DR
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
is.de_N2 <- decideTestsDGE(tr_N2, adjust.method="BH", p.value=0.1, lfc=0)
summary(is.de_N2)
##        -1*eat2.ev.15 1*N2.ev.15
## Down                       2216
## NotSig                    17219
## Up                         1559
tr_sfa1 <- glmTreat(fit, contrast=my.cont[,"sfa1_vs_ev_eat2_15"]) # changes from DR to DR + sfa-1 RNAi
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
is.de_sfa1 <- decideTestsDGE(tr_sfa1, adjust.method="BH", p.value=0.1, lfc=0)
summary(is.de_sfa1)
##        -1*eat2.ev.15 1*eat2.sfa1.15
## Down                           1241
## NotSig                        19341
## Up                              412

The heatmaps illustrate the change in expression pattern that occurs between No DR relative to DR and DR under sfa-1 RNAi relative to DR (day 15). The benefit of using a heatmap is that one can appreciate how the genes that change most in expression under the contrast of interest change across other samples. In the heatmap, genes with high positive Pearson correlation coefficients between their logCPM values will be in close proximity while others with lower correlation coefficients will be placed further away. For both heatmaps the 30 genes with the lowest p-values were used. Interestingly, these genes we selected that are most robustly differentially expressed behave very different at the other timepoint (day 3). This can be observed by the first dendrogram split being according to the sample age for both contrasts. Interestingly, these genes change their expression level more from young to middle-aged adult than whether they are restricted in their caloric intake. Furthermore, we see that the DR expression pattern of DR under the sfa-1 RNAi is not changed back to the no DR expression pattern but is very distinct.

library(viridis) #A very beautiful color palette for heatmaps
## Loading required package: viridisLite
suppressPackageStartupMessages(library(gplots)) # needed for the two heatmaps
cpm <- cpm(dge)
rownames(cpm) <- dge$genes$symbol
colnames(cpm) <- dge$samples$ID

# heatmap 1
ordered <- order(tr_N2$table$PValue)
h1 <- cpm[ordered[1:30],]
h1 <- t(scale(t(h1)))
heatmap.2(h1, col=viridis, Rowv=TRUE, scale="none", trace="none", dendrogram="column", cexRow=0.5, cexCol=0.5, density.info="none", main = "No DR rel. to DR (d15)",xlab = "Biological samples [4 replicates each]")

# heatmap 2
ordered <- order(tr_sfa1$table$PValue)
h2 <- cpm[ordered[1:30],]
h2 <- t(scale(t(h2)))
heatmap.2(h2, col=viridis, Rowv=TRUE, scale="none", trace="none", dendrogram="column", cexRow=0.5, cexCol=0.5, density.info="none", main = "DR under sfa-1 RNAi rel. to DR (d15)",xlab = "Biological samples [4 replicates each]")

We continue by determining which genes are down-regulated in both contrasts:

bool_down <- is.de_N2 ==-1 & is.de_sfa1==-1 # what genes are less expressed in both changes?
paste(sum(bool_down), "genes are downregulated in both contrasts")
## [1] "239 genes are downregulated in both contrasts"
genes_down <- fit$genes$gene.id[bool_down]
write.table(genes_down, file = paste0(output_dir, "/FCtables/", "genes_down.txt"),
              row.names = FALSE, col.names = TRUE,
              quote = FALSE, sep = "\t")

The The 239 genes were submitted to the DAVID (Huang, Sherman, and Lempicki 2008b, Huang, Sherman, and Lempicki (2008a)) webservice and the significantly enriched pathways (FDR cutoff: p-value < 0.1) were: lysosome and not as expected fatty acid processing.

# DAVID was utilized via the web interface because the R package RDAVIDWebService was not functioning properly. All files are read and saved to and from excel files and the Wormbase IDs of the top 2000 most differentially expressed genes are manually entered into the DAVID web service.
suppressPackageStartupMessages(library(readxl))
DAVID_genes_down <- read_excel(paste(output_dir,"/FCtables/DAVID_genes_down.xlsx",sep = "")) #N2_ev
DAVID_genes_down_ord <-  DAVID_genes_down[order(DAVID_genes_down$Benjamini),]
DAVID_genes_down_ord[,c("Term","Benjamini")][DAVID_genes_down_ord$Benjamini < 0.1,]

This is not in agreement with what Heintz et al. (2017) discribe. This is a valuable lesson that the sequence in which enrichment and intersect are determined are critical. One explanation is that many pathway components are moderately differentially regulated in each of the two contrasts. If the KEGG pathway enrichment is performed directly, these genes can contribute to the pathway enrichment. However, if the intersect of tthe differentially expressed genes is taken first the moderately but on their own not significantly differentially expressed genes are not able to contribute to the pathway enrichment and we obtain a very different result.

A GO term analysis indicates that members within the 239 genes are involved in metabolism. However, the GO terms are very vague.

suppressPackageStartupMessages(library(celegans.db))
suppressPackageStartupMessages(library(clusterProfiler))
gx <- groupGO(as.character(genes_down),
        OrgDb = celegans.db,
        keytype =  'ENSEMBL',
        ont      = "BP",
        level    = 3,
        readable = TRUE)

barplot(gx, drop=TRUE, order=TRUE, showCategory=15)

3.1.1 Determining the KEGG pathways first and then taking the intersect of the two contrasts

suppressPackageStartupMessages(library(gage))

for (cont in c("d15.vs.d3_N2_ev","sfa1_vs_ev_eat2_15")){
  print(paste("contrast",cont,"is being processed"))
# for the individual contrasts
qlf <- glmQLFTest(fit, contrast=my.cont[,cont]) # my GLM fit and contrast of interest (I use edgeR upstream)
tt <- topTags(qlf, adjust.method="BH", sort.by="PValue", n = Inf)

# Prepare query and receive data from KEGG
fc=tt$table$logFC
names(fc)=tt$table$entrez
exp.fc=fc
species="cel"
kegg.gs <- kegg.gsets(species=species, id.type="entrez")$kg.sets # query and download latest KEGG datasets from the KEGG servers
fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)

# extract the KEGG Pathway IDs --> specify if GREATER or LESS
suppressPackageStartupMessages(library(dplyr))
reg = data.frame(id=rownames(fc.kegg.p$less), fc.kegg.p$less) %>% 
  tbl_df() %>% filter(p.val < 0.1) %>% .$id %>% as.character()
#reg = substr(reg, start=1, stop=8) # extact KEGG ID

# save IDs to variable (named after contrast)
id <- paste("KEGG_IDs_",cont,sep = "")
assign(id, reg)
}
## [1] "contrast d15.vs.d3_N2_ev is being processed"
## [1] "contrast sfa1_vs_ev_eat2_15 is being processed"

The Kegg pathways which were found to be down-regulated (FDR cutoff: p-value < 0.1) in both contrasts separately are obtained and then compared between the two samples.

inters <- intersect(KEGG_IDs_d15.vs.d3_N2_ev, KEGG_IDs_sfa1_vs_ev_eat2_15) #KEGG pathways downregulated in both contrasts
authors <- c("Carbon","Fatty", "glycan", "Glycine", "serine", "threonine","Starch")
overlap <- unique (grep(paste(authors,collapse="|"), inters, value=TRUE))
overlap_IDs <- substr(overlap, start=1, stop=8) # extact KEGG ID
overlap #overlap between the author's analysis and the reanalysis
## [1] "cel01200 Carbon metabolism"                       
## [2] "cel00511 Other glycan degradation"                
## [3] "cel00260 Glycine, serine and threonine metabolism"
## [4] "cel00071 Fatty acid degradation"

Here we could obtain 12 significantly downregulated KEGG pathways under both contrasts. Even though KEGG pathway annotation is incomplete and often changing and virtually every component in the analysis pipeline was changed (for example: Salmon instead of STAR, edgeR instead of DESeq2) there is a very high overlap between the KEGG pathways I identified and what the authors describe. 33% of the pathways Heintz et al. (2017) showed in their figure overlapped with the significantly down-regulated pathways found in the re-analysis. Two of the four pathways with their specific expression changes are shown below.

suppressPackageStartupMessages(library(pathview))
for (i in overlap_IDs) {
  pathview(gene.data=exp.fc, pathway.id=i, species="cel", new.signature=FALSE) #run for several pathways
}

Glycine, serine and threonine metabolism (KEGG ID: cel00260) Fatty acid degradation (cel00071)

One can appreciate that of the expression changes that map to these pathways most are negative (colored green in the plots)

4 Gene expression change with age

## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.
## Zero log2-FC threshold detected. Switch to glmQLFTest() instead.

The agreement beween the most differentially expressed genes found here for these four contrasts and the top candidates the authors identified in their study is adressed in this section.

5 Pipeline comparison

Here, we compare the top differentially expressed genes (sorted by p-value) between the two pipelines. The methods used in the two RNA sequencing pipelines are very different. Heintz et al. (2017) used STAR as an alignment-based method while I used SALMON’s quasi-alignment method and there are a plethora of other differences both in the programs and settings used in the analysis. In order to receive a quick and robust metric to compare the two pipelines we take the top 2000 differentially expressed genes (sorted by p-value) across four contrasts. For this analysis we can use the four raw differential expression tables the authors provide in the supplement. Since in the publication most conclusions are drawn from the KEGG pathway enrichment we subject all datasets to pathway enrichment and compare the most enriched KEGG ID’s. For the KEGG pathway enrichment analysis we use DAVID (Huang, Sherman, and Lempicki 2008b, Huang, Sherman, and Lempicki (2008a)) and select the species C. elegans as background.

In summary, one can observe that there is a significant overlap between both pipelines and all four differential expression analysis tests that were performed. The comparisons made here are middle-aged (day 15) adults relative to young adult C. elegans (day 3) subjected to sfa-1 RNAi treatment (50% knock-down efficiency Heintz et al. (2017)) and ev control both for wild type (N2) and long-lived dietary restricted (eat-2) genotypes.

The differential expression tables that are loaded to R here were either retrieved from the supplement of Heintz et al. (2017) or written to file in the generalized linear model section.

#all comparing d15 realtive to d3
SI_dir <- "/Users/cys/Desktop/STA426_project/Supplement/"
SI_N2_ev <- read_excel(paste(SI_dir,"SI_Tab4_ev_top2000_DAVID.xlsx",sep = "")) #N2_ev
SI_N2_sfa1 <- read_excel(paste(SI_dir,"SI_Tab4_sfa1_top2000_DAVID.xlsx",sep = "")) #N2_sfa1
SI_eat2_ev <- read_excel(paste(SI_dir,"SI_Tab5_ev_top2000_DAVID.xlsx",sep = "")) #eat2_ev
SI_eat2_sfa1 <- read_excel(paste(SI_dir,"SI_Tab5_sfa1_top2000_DAVID.xlsx",sep = "")) #eat2_sfa1

# From my FASTQ reanalysis
RA_dir <- "/Users/cys/Desktop/STA426_project/Salmon_output/FCtables/"
My_N2_ev <- read_excel(paste(RA_dir,"d15.vs.d3_N2_ev_top2000_DAVID.xlsx",sep = ""))
My_N2_sfa1 <- read_excel(paste(RA_dir,"d15.vs.d3_N2_sfa1_top2000_DAVID.xlsx",sep = ""))
My_eat2_ev <- read_excel(paste(RA_dir,"d15.vs.d3_eat2_ev_top2000_DAVID.xlsx",sep = ""))
My_eat2_sfa1 <- read_excel(paste(RA_dir,"d15.vs.d3_eat2_sfa1_top2000_DAVID.xlsx",sep = ""))

5.0.0.1 Age comparison for wild type (N2) subjected to ev and sfa-1 RNAi treatment

suppressPackageStartupMessages(library(VennDiagram))
venn.plot <- venn.diagram(
x = list(substr(SI_N2_ev$Term, start=1, stop=8),
         substr(My_N2_ev$Term, start=1, stop=8)
         ),
filename = "d15.vs.d3_N2_ev.png",cex = 2,cat.fontface = 4,
resolution = 500,lty =1,imagetype = "png",
fill = c("chartreuse4", "chartreuse2"),
alpha = c(0.8, 0.1), 
category.names = c("Author's N2 ev","Reanalysis N2 ev"))
Kegg pathway enrichment comparison between the two pipelines for wild-type C. elegans subjected to control RNAi. The expression change of middle-aged animals relative to young animals are quantified

Kegg pathway enrichment comparison between the two pipelines for wild-type C. elegans subjected to control RNAi. The expression change of middle-aged animals relative to young animals are quantified

venn.diagram(
x = list(substr(SI_N2_sfa1$Term, start=1, stop=8),
         substr(My_N2_sfa1$Term, start=1, stop=8)
         ),
filename = "d15.vs.d3_N2_sfa1.png",cex = 2,cat.fontface = 4,lty =1,
resolution = 500,imagetype = "png",fill = c("brown3", "brown1"),
alpha = c(0.8, 0.1), 
category.names = c("Author's N2 sfa1","Reanalysis N2 sfa1"))
Kegg pathway enrichment comparison between the two pipelines for wild-type C. elegans subjected to sfa-1 RNAi. The expression change of middle-aged animals relative to young animals are quantified

Kegg pathway enrichment comparison between the two pipelines for wild-type C. elegans subjected to sfa-1 RNAi. The expression change of middle-aged animals relative to young animals are quantified

venn.diagram(
x = list(substr(SI_N2_ev$Term, start=1, stop=8),       #left-most
         substr(SI_N2_sfa1$Term, start=1, stop=8),     #right-most
         substr(My_N2_ev$Term, start=1, stop=8),       #middle-left
         substr(My_N2_sfa1$Term, start=1, stop=8)      #middle-right
         ),
filename = "d15.vs.d3_N2_ev_AND_sfa1.png",cex = 2,cat.fontface = 4,lty =1,
resolution = 500,imagetype = "png",
fill = c("chartreuse4", "brown3","chartreuse2","brown1"),
alpha = c(0.8, 0.8, 0.1, 0.1), 
category.names = c("Author's N2 ev", "Author's N2 sfa1", "Reanalysis N2 ev", "Reanalysis N2 sfa1"))
Combined Venn Diagram for both RNAi treatments of wild-type C. elegans

Combined Venn Diagram for both RNAi treatments of wild-type C. elegans

5.0.0.2 Age comparison for eat-2, DR (with ev and sfa-1 RNAi)

venn.plot <- venn.diagram(
x = list(substr(SI_eat2_ev$Term, start=1, stop=8),
         substr(My_eat2_ev$Term, start=1, stop=8)
         ),
filename = "d15.vs.d3_eat2_ev.png",cex = 2,cat.fontface = 4,lty =1,
resolution = 500,imagetype = "png",
fill = c("chartreuse4", "chartreuse2"),
alpha = c(0.8, 0.1), 
category.names = c("Author's DR ev","Reanalysis DR ev"))
Kegg pathway enrichment comparison between the two pipelines for dietary restricted C. elegans subjected to control RNAi. The expression change of middle-aged animals relative to young animals are quantified. These animals are long-lived compared to wild type

Kegg pathway enrichment comparison between the two pipelines for dietary restricted C. elegans subjected to control RNAi. The expression change of middle-aged animals relative to young animals are quantified. These animals are long-lived compared to wild type

venn.diagram(
x = list(substr(SI_eat2_sfa1$Term, start=1, stop=8),
         substr(My_eat2_sfa1$Term, start=1, stop=8)
         ),
filename = "d15.vs.d3_eat2_sfa1.png",cex = 2,cat.fontface = 4,lty =1,
resolution = 500,imagetype = "png",
fill = c("brown3", "brown1"),
alpha = c(0.8, 0.1), 
category.names = c("Author's DR sfa1","Reanalysis DR sfa1"))
Kegg pathway enrichment comparison between the two pipelines for dietary restricted C. elegans subjected to control RNAi. The expression change of middle-aged animals relative to young animals are quantified. Due to the sfa-1 knock-down these animals are not long-lived

Kegg pathway enrichment comparison between the two pipelines for dietary restricted C. elegans subjected to control RNAi. The expression change of middle-aged animals relative to young animals are quantified. Due to the sfa-1 knock-down these animals are not long-lived

venn.diagram(
x = list(substr(SI_eat2_ev$Term, start=1, stop=8),       #left-most
         substr(SI_eat2_sfa1$Term, start=1, stop=8),     #right-most
         substr(My_eat2_ev$Term, start=1, stop=8),       #middle-left
         substr(My_eat2_sfa1$Term, start=1, stop=8)      #middle-right
         ),
filename = "d15.vs.d3_eat2_ev_AND_sfa1.png",cex = 2,cat.fontface = 4,lty =1,
resolution = 500,imagetype = "png",
fill = c("chartreuse4", "brown3","chartreuse2","brown1"),
alpha = c(0.8, 0.8, 0.1, 0.1), 
category.names = c("Author's DR ev", "Author's DR sfa1", "Reanalysis DR ev", "Reanalysis DR sfa1"))
Combined Venn Diagram for both RNAi treatments of dietary restricted C. elegans

Combined Venn Diagram for both RNAi treatments of dietary restricted C. elegans

5.0.1 Reproduction of sfa-1 dependency

In the previous section we determined the KEGG pathways significantly altered by dietary restriction in day 15 animals in an SFA-1-dependent manner. In a further analysis Heintz et al. (2017) show that RNAi of sfa-1 did not block all dietary restriction-related changes to gene expression, but instead specifically reversed upregulation of lipid/fatty acid metabolism genes induced by dietary restriction. This can be seen in the boxed region in figure 3a. Furthermore, in the extended data figures 6e and 6f this is also discussed.

6 Excursion into splicing analysis

6.0.1 Technical aspect: HPC environment (Euler)

6.0.1.0.0.1 Transferred to Euler SCRATCH directory

The trimmed fastq files were uploaded in batch to my scratch directory on the EULER cluster as well as the reference files required by STAR. Here, an example code is shown on how to upload files to the cluster.

scp -r /Users/cys/Desktop/STA426_project/Reference_FASTA/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa statzerc@euler.ethz.ch:/cluster/scratch/statzerc/genome

6.0.2 STAR: Indexing the C. elegans genome

The C. elegans genome was downloaded from Ensembl.

Indexing of the C.elegans genome was performed in the following way and ran successfully to completion. The code utilized in the command line is depicted as understandable as possible.

6.0.3 STAR: Aligning the reads

Alignment for each paired end sample was performed as described below. In this example, an excessive amount of memory and CPU power was requested, which was due to problems with the BAM file sorting before and was afterwards reduced again.

# on Euler:
mkdir /cluster/scratch/statzerc/alignment7
cd /cluster/scratch/statzerc/alignment7

# creating a shell script into which the indented commands are written and saved
vi AlignmentScript7.sh
  module load star/2.5.3a
  cd /cluster/scratch/statzerc/alignment7
  echo "starting alignment with STAR"
  STAR --runThreadN 24 --runMode alignReads --genomeDir /cluster/scratch/statzerc/genome/ --alignIntronMax 50000 --sjdbGTFfile
  /cluster/scratch/statzerc/genome/Caenorhabditis_elegans.WBcel235.91.gtf --sjdbOverhang 100 --readFilesIn /cluster/scratch/statzerc/trimmed/trimmed_ERR1474693_1.fastq
  /cluster/scratch/statzerc/trimmed/trimmed_ERR1474693_2.fastq --outSAMtype BAM Unsorted SortedByCoordinate --outSAMstrandField intronMotif --outFilterType BySJout --quantMode
  GeneCounts

# The job was submitted using the job scheduling system
bsub -W 03:00 -n 24 -R "rusage[mem=2048]" -o output_file7_newParam.txt < AlignmentScript7.sh

Successfully completed.

Resource usage summary:

CPU time :                                   2139.53 sec.
Max Memory :                                 25627 MB
Average Memory :                             15733.22 MB
Total Requested Memory :                     49152.00 MB
Delta Memory :                               23525.00 MB
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                51
Run time :                                   793 sec.
Turnaround time :                            243959 sec.

starting alignment with STAR Jan 08 13:46:48 ….. started STAR run Jan 08 13:46:48 ….. loading genome Jan 08 13:47:52 ….. processing annotations GTF Jan 08 13:48:28 ….. inserting junctions into the genome indices Jan 08 13:48:47 ….. started mapping Jan 08 13:59:07 ….. started sorting BAM Jan 08 13:59:49 ….. finished successfully

6.0.4 Samtools merging bam files

The coordinate sorted bam files of the same biological condition were then merged using samtools. Below the merging of two sorted bam files is shown. The large difference in requested and used memory is again due to storage errors. This issue was later resolved. The storage problem had to do with Euler’s scratch memory infrastructure and not the submitted job. After this, the cpu, memory and walltime was reduced greatly.

# on Euler:
mkdir /cluster/scratch/statzerc/samtool_merge
cd /cluster/scratch/statzerc/samtool_merge

# creating a shell script into which the indented commands are written and saved
vi samtools_merge.sh
  module load samtools
  samtools merge merged_al5_al6.bam /cluster/scratch/statzerc/alignment5/Aligned.sortedByCoord.out.bam /cluster/scratch/statzerc/alignment6/Aligned.sortedByCoord.out.bam

# The job was submitted using the job scheduling system
bsub -W 02:00 -n 24 -R "rusage[mem=2048]" -o output_merging_2.txt < samtools_merge.sh

Successfully completed.

Resource usage summary:

CPU time :                                   604.78 sec.
Max Memory :                                 4342 MB
Average Memory :                             2129.66 MB
Total Requested Memory :                     49152.00 MB
Delta Memory :                               44810.00 MB
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                5
Run time :                                   888 sec.
Turnaround time :                            244120 sec.

6.0.5 Splicing analysis using SAJR

In the publication of Heintz et al. (2017) the splicing analysis has been performed using SAJR (Mazin et al. 2013). All files including the working demo example were downloaded. Intron retention analysis could be performed very easily on the provided demo files (Mazin et al. 2013). However, the analysis could not be adapted to C. elegans using either the gtf gene sets file as described in the publication by Heintz et al. (2017) or the gff3 gene sets as recommended by Mazin et al. (2013). Unfortunately, Mazin et al. (2013) did not respond to the email in which this problem was detailed and even after hours of testing the pipeline could not be adapted. A major drawback was that the body of the SAJR read counter program is written in Java and that there is very little online help available. Nevertheless, quantifying abarrant splicing events is very important and extremely interesting. Hopefully, further information will become available on how to use it on C.elegans samples.

References

Anders, Simon, Paul Theodor Pyl, and Wolfgang Huber. 2015. “HTSeq—a Python Framework to Work with High-Throughput Sequencing Data.” Bioinformatics 31 (2). Oxford University Press: 166–69.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300.

Chen, Yunshun, Aaron TL Lun, and Gordon K Smyth. 2014. “Differential Expression Analysis of Complex Rna-Seq Experiments Using edgeR.” In Statistical Analysis of Next Generation Sequencing Data, 51–74. Springer.

———. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of Rna-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.” F1000Research 5. Faculty of 1000 Ltd.

Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: Ultrafast Universal Rna-Seq Aligner.” Bioinformatics 29 (1). Oxford University Press: 15–21.

Heintz, Caroline, Thomas K Doktor, Anne Lanjuin, Caroline C Escoubas, Yue Zhang, Heather J Weir, Sneha Dutta, et al. 2017. “Splicing Factor 1 Modulates Dietary Restriction and Torc1 Pathway Longevity in c. Elegans.” Nature 541 (7635). Nature Research: 102–6.

Huang, Da Wei, Brad T Sherman, and Richard A Lempicki. 2008a. “Bioinformatics Enrichment Tools: Paths Toward the Comprehensive Functional Analysis of Large Gene Lists.” Nucleic Acids Research 37 (1). Oxford University Press: 1–13.

———. 2008b. “Systematic and Integrative Analysis of Large Gene Lists Using David Bioinformatics Resources.” Nature Protocols 4 (1). Nature Publishing Group: 44.

Kenyon, Cynthia J. 2010. “The Genetics of Ageing.” Nature 464 (7288). Nature Research: 504–12.

Kenyon, Cynthia, Jean Chang, Erin Gensch, Adam Rudner, and Ramon Tabtiang. 1993. “A c. Elegans Mutant That Lives Twice as Long as Wild Type.” Nature 366 (6454). Springer: 461–64.

Krämer, A. 1992. “Purification of Splicing Factor Sf1, a Heat-Stable Protein That Functions in the Assembly of a Presplicing Complex.” Molecular and Cellular Biology 12 (10). Am Soc Microbiol: 4545–52.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12). BioMed Central: 550.

López-Otín, Carlos, Maria A Blasco, Linda Partridge, Manuel Serrano, and Guido Kroemer. 2013. “The Hallmarks of Aging.” Cell 153 (6). Elsevier: 1194–1217.

Lun, Aaron TL, Yunshun Chen, and Gordon K Smyth. 2016. “It’s de-Licious: A Recipe for Differential Expression Analyses of Rna-Seq Experiments Using Quasi-Likelihood Methods in edgeR.” Statistical Genomics: Methods and Protocols. Springer, 391–416.

Lund, Steven P, Dan Nettleton, Davis J McCarthy, and Gordon K Smyth. 2012. “Detecting Differential Expression in Rna-Sequence Data Using Quasi-Likelihood with Shrunken Dispersion Estimates.” Statistical Applications in Genetics and Molecular Biology 11 (5).

Luo, Weijun, Michael S Friedman, Kerby Shedden, Kurt D Hankenson, and Peter J Woolf. 2009. “GAGE: Generally Applicable Gene Set Enrichment for Pathway Analysis.” BMC Bioinformatics 10 (1). BioMed Central: 161.

Martin, Marcel. 2011. “Cutadapt Removes Adapter Sequences from High-Throughput Sequencing Reads.” EMBnet. Journal 17 (1): pp–10.

Mazin, Pavel, Jieyi Xiong, Xiling Liu, Zheng Yan, Xiaoyu Zhang, Mingshuang Li, Liu He, et al. 2013. “Widespread Splicing Changes in Human Brain Development and Aging.” Molecular Systems Biology 9 (1). EMBO Press: 633.

McCarthy, Davis J, Yunshun Chen, and Gordon K Smyth. 2012. “Differential Expression Analysis of Multifactor Rna-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10). Oxford University Press: 4288–97.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods. Nature Research.

Robinson, Mark D, and Gordon K Smyth. 2007a. “Moderated Statistical Tests for Assessing Differences in Tag Abundance.” Bioinformatics 23 (21). Oxford University Press: 2881–7.

———. 2007b. “Small-Sample Estimation of Negative Binomial Dispersion, with Applications to Sage Data.” Biostatistics 9 (2). Oxford University Press: 321–32.

Weindruch, Richard, and Roy L Walford. 1988. Retardation of Aging and Disease by Dietary Restriction. CC Thomas.

Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2010. “Gene Ontology Analysis for Rna-Seq: Accounting for Selection Bias.” Genome Biology 11 (2). BioMed Central: R14.